# Dataset generation and organization

March 10 2026

# Datasets

**Note: these variable definitions were genereated using a language model pointed at relevant code as context. It was manually checked and modified by TS, but may still have errors. Contact timsainb@gmail.com with any questions.**


# `DATASET.HDF5`
## This HDF5 file contains cross-recording information about syllables, callibrations, and recording-information. 

# `DATASET.HDF5`

This file aggregates metadata, model outputs, morphology summaries, and references to individual recording-level files (`{RECORDING_ID}.HDF5`). The root contains several top-level groups and datasets described below.

---

## Root (`/`)

Top-level groups/datasets:
- `/calibrations` (group)
- `/model_free_syllables` (group)
- `/moseq_syllable_information` (group)
- `/pca` (group)
- `/phylo_tree` (dataset)
- `/recordings` (group)
- `/recordings_near_105_days` (dataset)
- `/skeleton` (group)
- `/smoseq_model_fits` (group)
- `/strain_info` (group)
- `/subject_size` (group)

---


## `/recordings_near_105_days` (dataset)

List of the recording nearest 105 days of age for each individual. Entries are recording identifiers (IDs).

- **Shape:** `(354,)`
- **Dtype:** object (`|O`)

---

## `/phylo_tree` (dataset)

Pnylogenetic tree estimate from VertLife.org (Newick Tree)

- **Shape:** `()` (scalar)
- **Dtype:** object (`|O`)

---

## `/calibrations` (group)

Calibration recordings and parameters, taken at the beginning of a recording data for each rig, stored in gimbal and jarvis format. See https://github.com/dattalab-6-cam/multicam-calibration/tree/main/multicam_calibration

---

## `/model_free_syllables` (group)

**Summary:**  
Model-free syllables are segmentation-derived behavioral "syllables" produced without fitting a MoSeq generative model. They are computed from PELT changepoint segmentation applied to combined kinematic and LLM-derived latent trajectories, and then characterized by per-syllable kinematic statistics, fixed-length pose/kinematic vectors, low dimensional embeddings, neighborhood graphs, and cluster assignments. This group contains the per-syllable feature table, per-syllable vector arrays used for dimensionality reduction and clustering, UMAP/KNN artifacts, and per-recording summary metrics.

**High-level definition / how they were computed**
- Segmentation: boundaries computed with PELT (ruptures) on a fused feature space combining:
  - continuous kinematic features (speed, body height, limb distances, curvature, etc.)
  - LLM/BERT latents (bert_latents) and their clipped absolute diffs
- Each PELT segment (between consecutive changepoints) is treated as a syllable.
- For each syllable we compute:
  - summary kinematic features (means, proportions for boolean heuristics)
  - a fixed-size vector representation created by temporally resampling/interpolating the per-frame combined feature matrix to `n_syll_timepoints` (typically 3)
- Dimensionality reduction and neighborhood structure:
  - PCA on flattened pose/latent maps (saved as `pose_pca`)
  - A balanced concatenation of pose PCs and z-scored continuous features (`final_features_flat`) was used as UMAP input
  - KNN (pynndescent) nearest neighbor indices and distances were computed for the final feature vectors
  - UMAP 2D embedding (`umap_2d`) saved for visualization
- Clustering:
  - Leiden clustering on an igraph built from KNN edges at multiple resolution parameters; cluster labels stored in the per-syllable table
- Important note about "model-free": these syllables are not MoSeq HHMM states. They are data-driven, changepoint-based segments. Mapping between the two is possible by aligning recording id and frame indices.

**Top-level contents (paths, example shapes & dtypes)**

- `/model_free_syllables/umap_2d`  
  - 2D embedding for visualization (UMAP).  
  - **Shape:** `(2319224, 2)`  
  - **Dtype:** `float32` (saved with attributes: `n_neighbors=10`, `min_dist=0.01`, `init="pca"`)

- `/model_free_syllables/pynndescent_knn_indices`  
  - KNN neighbor indices (from pynndescent nearest_neighbors).  
  - **Shape:** `(2319224, 10)` (example n_neighbors=10)  
  - **Dtype:** integer (e.g., `int32`)

- `/model_free_syllables/pynndescent_knn_dists`  
  - Corresponding neighbor distances.  
  - **Shape:** `(2319224, 10)`  
  - **Dtype:** float (e.g., `float32`)

- `/model_free_syllables/pose_pca`  
  - PCA projection of flattened pose/latent trajectory per syllable.  
  - **Shape:** `(2319224, n_pca_components)` (n_pca_components used during generation, e.g., 64)  
  - **Dtype:** `float32`  
  - **Attr:** `description = "PCA over flattened BERT trajectory (with 3 timepoints)"`

- `/model_free_syllables/final_features_flat`  
  - Final concatenated feature matrix used as input to UMAP. This is the weighted concat of pose PCA and the flattened z-scored continuous feature trajectories.  
  - **Shape:** `(2319224, D)` where `D = pose_pca.shape[1] + flattened_continuous_length`  
  - **Dtype:** `float32`  
  - **Attr:** description documenting the components and that it was used as UMAP input.

- `/model_free_syllables/syllable_feature_df_h5/` (group)  
  - A columnar HDF5 representation of the per-syllable pandas DataFrame. Each dataset listed below has length equal to the total syllable count: **2319224** (total syllables). All arrays are 1D with that length unless otherwise noted.

  Example column datasets (name → dtype / shape):
  - `video_recording` — `object` strings of recording IDs. **Shape:** `(2319224,)`  
  - `subject` — `object` strings. **Shape:** `(2319224,)`  
  - `start_idx` — `int32`. **Shape:** `(2319224,)`  
  - `end_idx` — `int32`. **Shape:** `(2319224,)`  
  - `changepoint_scores` — `float32`. **Shape:** `(2319224,)`  
  - `action` — `object` strings (heuristic action label: jump, walking, scrunch, pause, etc.). **Shape:** `(2319224,)`  
  - `is_jumping`, `is_moving`, `is_rearing`, `is_scrunching`, `is_short`, `is_turning_left`, `is_turning_right` — boolean (`|b1`). **Shape:** `(2319224,)`  
  - Continuous numeric kinematic features (examples):  
    - `acceleration`, `angular_velocity`, `angular_acceleration`, `body_height`, `max_body_position`, `min_body_position`, `distance_forepaw_to_nose`, `distance_tail_base_to_nose`, `forelimb_speed`, `hindlimb_speed`, `forelimb_acceleration`, `hindlimb_acceleration`, `limb_speed_diff`, `spine_curvature`, `min_hind_paw_height_diff`, `vertical_velocity`, `vertical_acceleration`, `speed` — **dtype:** `float32`, **shape:** `(2319224,)`
  - Leiden cluster labels at multiple resolutions: `leiden_clusters_0.25`, `leiden_clusters_0.5`, `leiden_clusters_1.0`, `leiden_clusters_2.0`, `leiden_clusters_4.0`, `leiden_clusters_8.0` — **dtype:** integer (`int16` or `int32` depending on export), **shape:** `(2319224,)`  
  - Any `object` dtype columns are stored as variable-length UTF-8 strings when written (`[str(i) for i in data]` converted prior to dataset creation).

  **Usage note:** The `video_recording`, `start_idx`, and `end_idx` fields are the canonical linkage back to recording-level time. To join a syllable to recording-level data or to MoSeq/HMM states, use `recordings/<video_recording>/...` and frame indices within that recording.

- `/model_free_syllables/individual_recordings/` (group)  
  - Per-recording summary metrics (N recordings = **739**). Datasets:
    - `n_sylls` — **Shape:** `(739,)`, **Dtype:** `int32`  
    - `video_recording` — **Shape:** `(739,)`, **Dtype:** object (recording IDs)  
    - `strain` — **Shape:** `(739,)`, **Dtype:** object  
    - `true_modularity`, `random_modularity`, `true_conductance`, `random_conductance`, `random_modularity` — **Shape:** `(739,)`, **Dtype:** `float32`  
  - These provide recording-level network/graph metrics computed on the per-syllable KNN graph as well as counts for validation and per-recording QC.

**Counts / totals**
- Total syllables: **2,319,224** (stored across `/model_free_syllables/*` datasets)  
- Number of recordings contributing syllables: **739** (per `/model_free_syllables/individual_recordings/video_recording`)

**How to map a model-free syllable to recording-level files**
- Each syllable row includes `video_recording` (string) and `start_idx`/`end_idx` (frame indices). To access the original per-frame data:
  - Open `/recordings/<video_recording>/` in the HDF5 and read frame arrays (for example `egocentric_coordinates_rigid`, `bert_latents` or continuous feature datasets).
  - `{RECORDING_ID}.HDF5` convention: `video_recording` string encodes the recording identifier used under `/recordings/`. Use the same string to open `recordings/<video_recording>` group in this aggregate HDF5 or to find the separate `{RECORDING_ID}.HDF5` recording-level file, if present. Frame indices are zero-based and align to those per-recording arrays.

---

## `/moseq_syllable_information` (group)

Syllable names (e.g. 'groom_4'), category (e.g. 'groom'), and ID number (e.g. 5) for each MoSeq syllable. These names/categories were determined manually post modelling. 

---

## `/pca` (group)

Principal Component Analysis model fit to rigid egocentrically aligned keypoint positions (25 keypoints x 3D = 75 PCA dims). 

### `/pca/components_` (dataset)
- **Shape:** `(75, 75)`
- **Dtype:** `float32`

### `/pca/explained_variance_` (dataset)
- **Shape:** `(75,)`
- **Dtype:** `float32`

### `/pca/explained_variance_ratio_` (dataset)
- **Shape:** `(75,)`
- **Dtype:** `float32`

### `/pca/mean_` (dataset)
- **Shape:** `(75,)`
- **Dtype:** `float32`

### `/pca/singular_values_` (dataset)
- **Shape:** `(75,)`
- **Dtype:** `float64`

---

## `/skeleton` (group)

Defines the keypoint set and connectivity graph.  
Number of keypoints: **25**

### `/skeleton/keypoint_names` (dataset)
- **Shape:** `(25,)`
- **Dtype:** object (`|O`)

### `/skeleton/keypoint_colors` (dataset)
- **Shape:** `(25, 3)`
- **Dtype:** `int64`

### `/skeleton/skeleton_edges` (dataset)
- **Shape:** `(28, 2)`
- **Dtype:** object (`|O`)
- Edge endpoints are stored as keypoint identifiers.

`skeleton_edges` stores pairs of keypoint *names* (strings). 

### `/skeleton/skeleton_edge_colors` (dataset)
- **Shape:** `(28, 3)`
- **Dtype:** `int64`

Colormap to use for skeleton edges in plotting.

---

## `/smoseq_model_fits` (group)

sMoSeq model fit statistics for 180 seperate model runs (# states, iteration number, likelihood on held-out data, state duration). 

### `/smoseq_model_fits/held_out_likelihood` (dataset)
- **Shape:** `(180,)`
- **Dtype:** `float32`

### `/smoseq_model_fits/mean_state_duration` (dataset)
- **Shape:** `(180,)`
- **Dtype:** `float32`

### `/smoseq_model_fits/model_iteration` (dataset)
- **Shape:** `(180,)`
- **Dtype:** `int64`

### `/smoseq_model_fits/n_states` (dataset)
- **Shape:** `(180,)`
- **Dtype:** `int64`

---

## `/strain_info` (group)

Strain taxonomy and visualization metadata.

- `/strain_info/strain_colormap` (group)
- `/strain_info/strain_to_genus` (group)
- `/strain_info/strain_to_species` (group)
- `/strain_info/strain_to_subspecies` (group)


---

## `/subject_size` (group)

Per-recording morphometric summaries.  
Number of recordings: **740**

### Distance summary datasets

Distances between keypoint positions in millimeters and computed from 3D keypoints. For each anatomical pair dataset `A-B`, there is a corresponding `A-B_std` giving the std across frames.

Examples (non-exhaustive; see file listing for full set):
- `forehead-left_ear`, `forehead-left_ear_std`
- `nose_tip-forehead`, `nose_tip-forehead_std`
- `spine_high-throat`, `spine_high-throat_std`
- `tail_base-spine_low`, `tail_base-spine_low_std`
- ... (many others)

All distance and distance-std datasets:
- **Shape:** `(# recordings,)`
- **Dtype:** `float32`


### Recording metadata datasets

- `/subject_size/strain`
  - **Shape:** `(# recordings,)`
  - **Dtype:** `|S8`

- `/subject_size/subject`
  - **Shape:** `(# recordings,)`
  - **Dtype:** `|S15`

- `/subject_size/video_recording_id`
  - **Shape:** `(# recordings,)`
  - **Dtype:** `|S24`


# `{RECORDING_ID}.HDF5` Files

## These HDF5 files contain frame-by-frame information (including keypoint coordinates) for each individual recording file.


## `Recording ID`: The recording time & Unique ID of the recording 

Conventions:
- `(# frames, ...)` refers to the time axis length for the recording.
- Units assume arena/egocentric coordinates are in **mm** and sampling rate is **120 Hz** (as used in feature computation).
---

## `/HHMM` (group)

Step-level MoSeq/HHMM annotations aligned to syllable segments.

- `start_idx`  
  Frame when the syllable begins.  
  **Shape:** `(# syllables,)`  
  **Dtype:** integer-like

- `end_idx`  
  Frame when the syllable ends.  
  **Shape:** `(# syllables,)`  
  **Dtype:** integer-like

- `moseq_syllable`  
  MoSeq syllable ID per segment.  
  **Shape:** `(# syllables,)`  
  **Dtype:** integer-like

- `smoseq_states_N`  
  HHMM/SmoSeq state identity for a model trained with `N` HHMM states.  
  **Shape:** `(1, # syllables)`  
  **Dtype:** integer-like

---

## `/steps` (group)

Per-step-cycle gait features. A “step” here is defined as the interval between **consecutive right hind paw (rhp) velocity peaks** during locomotion bouts.  
For this recording: **# steps = 2684**.

### Segmentation
- `index`  
  Step index for each detected step cycle (reference limb cycles).  
  **Shape:** `(# steps,)`  
  **Dtype:** int

- `start_frame`  
  Start frame of the step cycle (frame index of an rhp velocity peak).  
  **Shape:** `(# steps,)`  
  **Dtype:** int

- `end_frame`  
  End frame of the step cycle (next rhp velocity peak).  
  **Shape:** `(# steps,)`  
  **Dtype:** int

- `minima`  
  Frame index of the within-cycle minimum (rhp trough between the two peaks).  
  **Shape:** `(# steps,)`  
  **Dtype:** float or int (stored as float in listing)

### Timing
- `step_cycle`  
  Step cycle duration in frames: `end_frame - start_frame`.  
  **Shape:** `(# steps,)`  
  **Units:** frames  
  **Dtype:** float (per listing)

- `step_cycle_hz`  
  Step cycle frequency: `120 / step_cycle`.  
  **Shape:** `(# steps,)`  
  **Units:** Hz  
  **Dtype:** float

- `duty_cycle`  
  Fraction of cycle until the minima: `(minima - start_frame) / step_cycle`.  
  **Shape:** `(# steps,)`  
  **Units:** unitless  
  **Dtype:** float

### Inter-limb phase (within the rhp cycle)
Phases are normalized to `[0, 1]` within the `[start_frame, end_frame]` rhp cycle. If a matching limb event is not found within tolerance, phase is `NaN`.

- `rhp`  
  Reference limb phase, always `0` by definition.  
  **Shape:** `(# steps,)`  
  **Units:** unitless phase

- `lhp`  
  Left hind paw phase within the rhp cycle.  
  **Shape:** `(# steps,)`  
  **Units:** unitless phase

- `rfp`  
  Right fore paw phase within the rhp cycle.  
  **Shape:** `(# steps,)`  
  **Units:** unitless phase

- `lfp`  
  Left fore paw phase within the rhp cycle.  
  **Shape:** `(# steps,)`  
  **Units:** unitless phase

### Gait classification
- `gait_state`  
  String gait label derived from `duty_cycle` and phase relationships (e.g., walk/pace/trot/half-bound/full_bound/gallop).  
  **Shape:** `(# steps,)`  
  **Dtype:** string/object

- `arhmm_step_type`  
  Numeric step-type code from an ARHMM step model (if populated by a separate model).  
  **Shape:** `(# steps,)`  
  **Dtype:** float

---

## `/confidences_2d` (dataset)

Confidence scores for 2D keypoint predictions.  
**Shape:** `(# frames, 6, 25)`

---

## `/confidences_3d` (dataset)

Confidence scores for 3D keypoint predictions.  
**Shape:** `(# frames, 25)`

---

## `/predictions_2d` (dataset)

Predicted 2D keypoint coordinates in millimeters.  
**Shape:** `(# frames, 6, 25, 2)`

---

## `/predictions_3d` (dataset)

Predicted 3D keypoint coordinates in millimeters.  
**Shape:** `(# frames, 25, 3)`  
**Units:** typically mm

---

## `/gimbal` (dataset)

3D coordinates after Gimbal refinement.  
**Shape:** `(# frames, 25, 3)`  
**Units:** typically mm

---

## `/size_normalized_gimbal_keypoints` (dataset)

Gimbal-smoothed 3D keypoints, size-normalized to a reference mouse.  
**Shape:** `(# frames, 25, 3)`  
**Units:** typically mm

---

## `/bad_samples` (dataset)

Mask of low-quality or invalid samples.  
**Shape:** `(# frames,)`  
**Dtype:** bool

---

## `/arena_aligned_coordinates` (dataset)

3D keypoint coordinates expressed in an arena-fixed reference frame.  
Loaded from `coordinates_aligned.*.mmap`.  
**Shape:** `(# frames, 25, 3)`  
**Units:** typically mm

---

## `/egocentric_coordinates_rigid` (dataset)

3D keypoint coordinates expressed in a rigid egocentric frame.  
Loaded from `egocentric_alignment_rigid.*.mmap`.  
**Shape:** `(# frames, 25, 3)`  
**Units:** typically mm

---

## Continuous features (datasets)

All features below are computed at **120 Hz**.

Smoothing defaults (as used in the code):
- Many 1D traces use a **Savitzky–Golay** filter with `window_length=65` frames, `polyorder=8` (because `window_length=64` is made odd).
- Some traces use `scipy.ndimage.uniform_filter1d` with:
  - `position_kernel_size_frames = 18` (150 ms at 120 Hz)
  - `limb_velocity_kernel_size_frames = 18` (150 ms at 120 Hz)
  - `spine_curvature_kernel_size_frames = 19` (150 ms at 120 Hz, forced odd)
  - `neck_angle_kernel_size_frames = 19` (150 ms at 120 Hz, forced odd)
- Heading uses complex-angle uniform smoothing with `heading_kernel_size_frames = 5` (33 ms at 120 Hz, forced odd).

### `/centroid_x` (dataset)
X position of the body centroid: `mean(arena_aligned_coordinates, axis=keypoints)[:, 0]`.  
**Shape:** `(# frames,)`  
**Units:** mm

### `/centroid_y` (dataset)
Y position of the body centroid: `mean(arena_aligned_coordinates, axis=keypoints)[:, 1]`.  
**Shape:** `(# frames,)`  
**Units:** mm

### `/max_body_position` (dataset)
Maximum z height above inferred floor.  
- `z = arena_aligned_coordinates[:, :, 2]`  
- `floor = median(min(z over keypoints) over frames)`  
- `z' = z - floor`  
- `max_body_position = max(z' over keypoints)`  
- Smoothed with Savitzky–Golay (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm

### `/min_body_position` (dataset)
Minimum z height above inferred floor, same floor correction as above.  
- `min_body_position = min(z' over keypoints)`  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm

### `/body_height` (dataset)
Body height estimate per frame: `max_body_position - min_body_position`, then Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm

### `/speed` (dataset)
Planar centroid speed:
- `delta = ||centroid_xy[t+1] - centroid_xy[t]||`  
- `speed[t] = delta * 120`  
- Last frame padded with `0`  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm/s

### `/acceleration` (dataset)
Planar speed derivative:
- `acceleration = [0] + diff(speed)` (difference of the smoothed `speed`)  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** (mm/s) per frame  
Note: multiply by `120` to approximate mm/s².

### `/vertical_velocity` (dataset)
Vertical velocity from `max_body_position`:
- `vertical_velocity[t] = (max_body_position[t+1] - max_body_position[t]) * 120`  
- Last frame padded with `0`  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm/s

### `/vertical_acceleration` (dataset)
Vertical velocity derivative:
- `vertical_acceleration = [0] + diff(vertical_velocity)` (difference of smoothed `vertical_velocity`)  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** (mm/s) per frame  
Note: multiply by `120` to approximate mm/s².

### `/heading` (dataset)
Heading angle in the arena XY plane:
- `head_point = mean(keypoints ['nose_tip','forehead','spine_high'])`  
- `tail_point = mean(keypoints ['spine_low','tail_base'])`  
- `direction = head_point - tail_point`  
- `heading = atan2(direction_y, direction_x)`  
- Smoothed by complex-angle uniform filtering with `kernel_size=5` frames.  
**Shape:** `(# frames,)`  
**Units:** radians in `[-π, π]`

### `/angular_velocity` (dataset)
Yaw angular velocity:
- `angular_velocity[t] = diff(unwrap(heading)) * 120`  
- Last frame padded with `0`  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** rad/s

### `/angular_acceleration` (dataset)
Yaw angular acceleration proxy:
- `angular_acceleration = diff([0] + angular_velocity)` (difference of smoothed `angular_velocity`)  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** (rad/s) per frame  
Note: multiply by `120` to approximate rad/s².

### `/forelimb_speed` (dataset)
Mean forelimb keypoint speed:
- Compute frame-to-frame differences of arena-aligned coordinates, smooth those differences with `uniform_filter1d(size=18)` frames, multiply by `120`, then take L2 norm over xyz per keypoint.  
- Average speeds across forelimb keypoints:
  `['right_shoulder','right_elbow','right_wrist','right_fore_paw','left_shoulder','left_elbow','left_wrist','left_fore_paw']`  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm/s

### `/hindlimb_speed` (dataset)
Same computation as `forelimb_speed`, but averaged over hindlimb keypoints:
`['right_knee','right_hind_paw_back','right_hind_paw_front','left_knee','left_hind_paw_back','left_hind_paw_front']`.  
Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm/s

### `/forelimb_acceleration` (dataset)
Derivative of `forelimb_speed`:
- `diff(forelimb_speed)` then Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`)  
- Last frame padded by repeating the final value.  
**Shape:** `(# frames,)`  
**Units:** (mm/s) per frame  
Note: multiply by `120` to approximate mm/s².

### `/hindlimb_acceleration` (dataset)
Derivative of `hindlimb_speed`:
- `diff(hindlimb_speed)` then Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`)  
- Last frame padded by repeating the final value.  
**Shape:** `(# frames,)`  
**Units:** (mm/s) per frame  
Note: multiply by `120` to approximate mm/s².

### `/limb_speed_diff` (dataset)
Forelimb minus hindlimb speed:
- `limb_speed_diff = forelimb_speed - hindlimb_speed`  
**Shape:** `(# frames,)`  
**Units:** mm/s

### `/distance_forepaw_to_nose` (dataset)
Distance from nose_tip to the nearer forepaw:
- `dL = ||nose_tip - left_fore_paw||`  
- `dR = ||nose_tip - right_fore_paw||`  
- `distance_forepaw_to_nose = min(dL, dR)`  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm

### `/distance_tail_base_to_nose` (dataset)
Euclidean distance between nose_tip and tail_base, arena frame.  
Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm

### `/spine_curvature` (dataset)
Median 3D curvature of the trunk polyline per frame using trunk keypoints:
`['tail_base','spine_low','spine_mid','spine_high','forehead']`.
- For each frame, compute discrete curvature along the polyline using the 3D curvature formula  
  `κ = |r' × r''| / |r'|^3` (implemented via finite differences along the keypoint chain), then take the median curvature across segments.  
- Smooth with `uniform_filter1d(size=19)` frames.  
**Shape:** `(# frames,)`  
**Units:** 1/mm (if coordinates are mm)

### `/extension_neck` (dataset)
Neck extension angle in **degrees** in egocentric coordinates:
- `head_center = mean(['nose_tip','forehead','left_eye','right_eye'])`  
- Compute joint angle at `spine_high` between vectors `(spine_mid -> spine_high)` and `(head_center -> spine_high)` via `arccos` of normalized dot product, then convert to degrees.  
- Smooth with `uniform_filter1d(size=19)` frames.  
**Shape:** `(# frames,)`  
**Units:** degrees

### `/forelimb_hindlimb_height_diff` (dataset)
Relative paw height (forepaws vs hindpaws), arena z:
- `min_hindpaw_z = min(z of ['right_hind_paw_front','left_hind_paw_front'])`  
- `min_forepaw_z = min(z of ['right_fore_paw','left_fore_paw'])`  
- Each min-z trace is smoothed with `uniform_filter1d(size=18)` frames  
- `forelimb_hindlimb_height_diff = min_forepaw_z - min_hindpaw_z`  
- Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`).  
**Shape:** `(# frames,)`  
**Units:** mm

### `/min_hind_paw_height_diff` (dataset)
Within-hindpaw height difference (back vs front), using **negated** arena z:
- Right: `right_diff = (-z[right_hind_paw_back]) - (-z[right_hind_paw_front])`  
- Left:  `left_diff  = (-z[left_hind_paw_back])  - (-z[left_hind_paw_front])`  
- Each z trace smoothed with `uniform_filter1d(size=18)` frames  
- Each diff Savitzky–Golay smoothed (`window_length=65`, `polyorder=8`)  
- `min_hind_paw_height_diff = min(left_diff, right_diff)`  
**Shape:** `(# frames,)`  
**Units:** mm (sign depends on the coordinate convention)

---


### Additional :
- `/bert_latents` — shape `(# frames, 128)`, the latent values from the BERT projection of egocentric coordinates
- `/meters_traveled` — shape `(# frames - 1,)`, the distance (in meters) travelled by the mouse, estimated, per frame
- `/moseq_syllable_distribution` — shape `(50,)` # the distribution of moseq syllables in that dataset
- `/moseq_syllable_distribution_names` — shape `(50,)`, the name of the moseq syllables (human-annotated)
- `/moseq_syllable_latents` — shape `(# frames, 10)`, latents of the MoSeq model (PCs of egocentrics)
- `/moseq_syllables` — shape `(# frames,)`.  MoSeq syllable IDs
- `/moseq_transition_matrix` — shape `(50, 50)`, a transition matrix inferred from MoSeq syllable sequence
- `/movement_states` — shape `(# frames,)`, movement state inferred from kinematic features
- `/smoothed_ethogram_umap_projections` — shape `(# sampled frames, 2)`, a projection of a smoothed ethogram (from model free syllables)
- `/smoothed_ethogram_umap_sample_numbers` — shape `(# sampled frames,)`, the frame sample number corresponding to the smoothed ethogram projection
- `/thigmotaxis` — shape `(# frames,)`, estimate of when the mouse is in thigmotaxis, generated from xy centroid of mouse
- `/umap_projections` — shape `(# syllables, 2)`, 2D UMAP projection of model-free syllales